In univariate OLS regression, we are predicting a single outcome variable from a single predictor variable.
In our example, we will use the USJudgeRatings dataset.
This dataset contains ratings of 43 state judges on 12 dimensions.
# Load some packages in
library(estimatr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
# Load in the dataset that we will use
require(graphics)
head(USJudgeRatings)
We are going to start out with a simple linear regression model with
one predictor. We will predict RTEN (Worthy of retention)
from CONT (Number of contacts of lawyer with judge).
# Fit the linear regression model
model_1_predictor <- lm(RTEN ~ INTG, data = USJudgeRatings)
# Create the plot
ggplot(USJudgeRatings, aes(x = INTG, y = RTEN)) +
geom_point() + # Add data points
geom_smooth(method = "lm", se = FALSE, color = "blue") + # Add regression line
geom_segment(aes(xend = INTG, yend = fitted(model_1_predictor)),
color = "red", alpha = 0.5) + # Add residuals
labs(title = "OLS Regression with 1 Predictor",
x = "Intergrity rating (INTG)",
y = "Worthy of retention (RTEN)",
caption = "Red lines represent residuals") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
tidy(lm_robust(RTEN ~ INTG, data = USJudgeRatings))
In multiple OLS regression, we are predicting a single outcome variable from multiple predictor variables.
# Fit the linear regression model
model <- lm(RTEN ~ INTG + FAMI, data = USJudgeRatings)
# Create a grid of predictor values for INTG and FAMI
INTG_seq <- seq(min(USJudgeRatings$INTG), max(USJudgeRatings$INTG), length.out = 50)
FAMI_seq <- seq(min(USJudgeRatings$FAMI), max(USJudgeRatings$FAMI), length.out = 50)
prediction_grid <- expand.grid(INTG = INTG_seq, FAMI = FAMI_seq)
prediction_grid$RTEN <- predict(model, newdata = prediction_grid)
# Reshape predictions into a matrix for plotly's surface
# Rows correspond to FAMI and columns to INTG.
z_matrix <- matrix(prediction_grid$RTEN,
nrow = length(FAMI_seq),
ncol = length(INTG_seq))
# Create the interactive 3D plot with the regression surface and data points
fig <- plot_ly() %>%
add_surface(x = INTG_seq, y = FAMI_seq, z = z_matrix,
opacity = 0.5, showscale = FALSE,
name = 'Regression Surface') %>%
add_markers(data = USJudgeRatings,
x = ~INTG, y = ~FAMI, z = ~RTEN,
marker = list(color = 'blue', size = 3),
name = 'Data Points')
# Compute residuals: lines from the fitted value at (INTG, FAMI) to the observed RTEN
n <- nrow(USJudgeRatings)
# Prepare vectors for line segments: for each data point, we create a segment:
# (INTG, FAMI, fitted) --> (INTG, FAMI, observed)
x_lines <- rep(NA, 3 * n)
y_lines <- rep(NA, 3 * n)
z_lines <- rep(NA, 3 * n)
# For each observation, fill the vectors with two points and an NA to break the line
x_lines[seq(1, 3*n, by = 3)] <- USJudgeRatings$INTG # starting x (fitted)
x_lines[seq(2, 3*n, by = 3)] <- USJudgeRatings$INTG # ending x (observed)
x_lines[seq(3, 3*n, by = 3)] <- NA # break
y_lines[seq(1, 3*n, by = 3)] <- USJudgeRatings$FAMI # starting y (fitted)
y_lines[seq(2, 3*n, by = 3)] <- USJudgeRatings$FAMI # ending y (observed)
y_lines[seq(3, 3*n, by = 3)] <- NA # break
z_lines[seq(1, 3*n, by = 3)] <- model$fitted.values # starting z (fitted value)
z_lines[seq(2, 3*n, by = 3)] <- USJudgeRatings$RTEN # ending z (observed)
z_lines[seq(3, 3*n, by = 3)] <- NA # break
# Add the residual lines to the plot
fig <- fig %>%
add_trace(x = x_lines, y = y_lines, z = z_lines,
type = 'scatter3d', mode = 'lines',
line = list(color = 'black', width = 2),
name = "Residuals") %>%
layout(title = "OLS regression with 2 predictor variables",
scene = list(xaxis = list(title = "INTG: judicial integrity"),
yaxis = list(title = "FAMI: familiarity with law"),
zaxis = list(title = "RTEN: worthy of retention")))
fig
# Run the OLS regression with robust standard errors
model_robust_se <- lm_robust(RTEN ~ INTG + FAMI, data = USJudgeRatings)
tidy(model_robust_se)
For this question, we will use the “2011 Canadian National Election Study, With Attitude Toward Abortion” dataset, available here: https://vincentarelbundock.github.io/Rdatasets/csv/carData/CES11.csv
The documentation for the dataset is here: https://vincentarelbundock.github.io/Rdatasets/doc/carData/CES11.html
We run a logistic regression to predict whether a respondent in the survey thinks that abortion should be banned based on the following variables:
We output the results with odds ratios and 95% confidence intervals generated from robust standard errors.
# Calculate robust standard errors using sandwich estimator
library(sandwich)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(broom)
library(modelsummary)
# Load in the dataset
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/carData/CES11.csv")
# Recoding varibles
# gender
df$gender_female_c <- ifelse(df$gender == "Female", 1, 0)
table(df$gender, df$gender_female_c)
##
## 0 1
## Female 0 1244
## Male 987 0
# importance of religion
df$importance_c <- NA
df$importance_c[df$importance == "not"] <- 0
df$importance_c[df$importance == "notvery"] <- 1
df$importance_c[df$importance == "somewhat"] <- 2
df$importance_c[df$importance == "very"] <- 3
# check that we re-coded correctly
table(df$importance, df$importance_c)
##
## 0 1 2 3
## not 607 0 0 0
## notvery 0 315 0 0
## somewhat 0 0 714 0
## very 0 0 0 595
# education
# a factor with (alphabetical) levels bachelors (Bachelors degree), college (community college or technical school), higher (graduate degree), HS (high-school graduate), lessHS (less than high-school graduate), somePS (some post-secondary).
# We are ordering this factor so that the levels are in the order of education level
df$education_c <- factor(df$education,
levels = c("lessHS", "HS", "somePS",
"college", "bachelors", "higher"))
table(df$education_c)
##
## lessHS HS somePS college bachelors higher
## 267 467 254 491 506 246
# urban/rural
df$urban_c <- ifelse(df$urban == "urban", 1, 0)
# abortion
df$abortion_c <- ifelse(df$abortion == "Yes", 1, 0)
table(df$abortion_c)
##
## 0 1
## 1818 413
# Run the logistic regression
md <- glm(formula = abortion_c ~ gender_female_c + importance_c +
education_c + urban_c, data = df, family = "binomial")
coeftest(md, vcov = vcovHC(md, type = "HC3"))
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.929894 0.283108 -10.3490 < 2.2e-16 ***
## gender_female_c -0.349390 0.123087 -2.8386 0.0045318 **
## importance_c 1.174634 0.091364 12.8567 < 2.2e-16 ***
## education_cHS -0.340402 0.189507 -1.7962 0.0724552 .
## education_csomePS -0.630924 0.229971 -2.7435 0.0060789 **
## education_ccollege -0.500686 0.195072 -2.5667 0.0102679 *
## education_cbachelors -0.855758 0.201520 -4.2465 2.171e-05 ***
## education_chigher -0.879156 0.258249 -3.4043 0.0006633 ***
## urban_c -0.302436 0.130887 -2.3107 0.0208518 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Get model results with confidence intervals
# Odds ratio: need to set exponentiate = TRUE
# Add in robust standard errors by vcov = vcovHC(model, type = "HC3")
md_results <- tidy(md, vcov = vcovHC(md, type = "HC3"),
conf.int = TRUE, conf.level = 0.95, exponentiate = TRUE)
md_results
Write a brief interpretation of the results above (~150 words) in terms of the odds ratios and their confidence intervals.
The logistic regression results indicate that respondents’ demographic and religious characteristics are meaningfully associated with their opinions on whether abortion should be banned. Adjusting for other variables:
As someone’s self-reported importance of random goes up 1 point on the scale, they are 3.24 times more likely to oppose abortion (\(p\)-value <0.001).
Female respondents are 0.705 times less likely to support a ban on abortion compared to male respondents (\(p\)-value less than 0.01).
In terms of education, each increased level of education (compared with the reference category of less than high school) is associated with lower odds of supporting a ban on abortion.
Respondents living in urban areas are 0.739 times less likely to support a ban on abortion compared to those living in rural areas (\(p\)-value <0.05).
You can use the logistic regression model to predict the probability of a respondent supporting a ban on abortion based on their demographic and religious characteristics.
# Get the predicted probability for all the respondents in the dataset
# Make sure you set the type to "response" to get the probability of the outcome
all_respondents_predict <- predict(md, newdata = df, type = "response")
# Make a histogram of the predicted probabilities
hist(all_respondents_predict, breaks = 20,
main = "Predicted probabilities of supporting a ban on abortion",
xlab = "Predicted probability", ylab = "Frequency")
# Get the predicted probability for a specific respondent
# This respondent: lives in urban area, indicates that religion is not very important to them,
# has a high school education, is male
# education_c = "HS"
predict(md, newdata = data.frame(urban_c = 1,
importance_c = 1,
education_c = "HS",
gender_female_c = 0),
type = "response")
## 1
## 0.08331787